Introduction to R Programming

Part II: Data Manipulation

Welcome

Welcome to the data manipulation part of the Intro to R Programming Workshop!

Data frames

Data frames are the main R object that we will be interacting with. In many ways you already know about them too.

An example for a data frame would be the table from the Animal Ageing and Longevity Database we already saw earlier.

Animal Maximum Longevity (in years)
Human 122.5
Domestic dog 24.0
Domestic cat 30.0
American alligator 77.0
Golden hamster 3.9
King penguin 26.0
Lion 27.0
Greenland shark 392.0
Galapagos tortoise 177.0
African bush elephant 65.0
California sea lion 35.7
Fruit fly 0.3
House mouse 4.0
Giraffe 39.5
Wild boar 27.0
human_lifespan <- 122.5
dog_lifespan <- 24
lion_lifespan <- 27
mouse_lifespan <- 4
fly_lifespan <- 0.3
boar_lifespan <- 27
alligator_lifespan <- 77
greenland_shark_lifespan <- 392
galapagos_tortoise_lifespan <- 177

animal_lifespans <- c(greenland_shark_lifespan, dog_lifespan, 
  galapagos_tortoise_lifespan,
  mouse_lifespan, fly_lifespan,
  lion_lifespan, boar_lifespan, 
  alligator_lifespan, human_lifespan)

animals <- c("greenland_shark", "dog", 
  "galapagos_tortoise", "mouse", 
  "fly", "lion", "boar",
  "alligator", "human")

To create a data frame from scratch we can simply pass two (same-sized) vectors to the function data.frame.

data.frame(animals, animal_lifespans)
##              animals animal_lifespans
## 1    greenland_shark            392.0
## 2                dog             24.0
## 3 galapagos_tortoise            177.0
## 4              mouse              4.0
## 5                fly              0.3
## 6               lion             27.0
## 7               boar             27.0
## 8          alligator             77.0
## 9              human            122.5

We can also assign data frames.

animals_data <- data.frame(animals, animal_lifespans)

animals_data
##              animals animal_lifespans
## 1    greenland_shark            392.0
## 2                dog             24.0
## 3 galapagos_tortoise            177.0
## 4              mouse              4.0
## 5                fly              0.3
## 6               lion             27.0
## 7               boar             27.0
## 8          alligator             77.0
## 9              human            122.5

Data Dimensions

We can use functions to determine the shape of our data.

How many columns does the data have?

We can simply use the function ncol() to determine the number of columns.

ncol(animals_data)
## [1] 2

How many rows does the data have?

Run nrow() to determine the number of rows.

nrow(animals_data)
## [1] 9

dim()

We can also use dim() to get the same information in one call:

dim(animals_data)
## [1] 9 2

1st value counts the rows, 2nd value counts the columns.

Variable Names

We can also retrieve the variable names of any data frame by passing it to names().

names(animals_data)
## [1] "animals"          "animal_lifespans"

Retrieve variables

If we want to retrieve specific variables from a data frame we can do that via the $ operator.

\[\color{red}{\text{dataset}}$\color{orange}{\text{variable_name}}\]

Think of the $ symbol as a door opener that helps you check what is inside an object.

animals_data$animal_lifespans
## [1] 392.0  24.0 177.0   4.0   0.3  27.0  27.0  77.0 122.5
animals_data$animals
## [1] "greenland_shark"    "dog"                "galapagos_tortoise"
## [4] "mouse"              "fly"                "lion"              
## [7] "boar"               "alligator"          "human"

(Re-)Code variables

We can also use the $ data access to add new variables.

In the below case we create a variable called animal_to_human which holds all the human to animal years conversions.

We do that by simply assigning a vector containing that information to animals_data$animal_to_human even if that variable doesn’t exist yet.

animals_data$animal_to_human <- animals_data$animal_lifespans / human_lifespan
animals_data
##              animals animal_lifespans animal_to_human
## 1    greenland_shark            392.0      3.20000000
## 2                dog             24.0      0.19591837
## 3 galapagos_tortoise            177.0      1.44489796
## 4              mouse              4.0      0.03265306
## 5                fly              0.3      0.00244898
## 6               lion             27.0      0.22040816
## 7               boar             27.0      0.22040816
## 8          alligator             77.0      0.62857143
## 9              human            122.5      1.00000000

Indexing

Just as we did before with vectors we can also index data frames with square brackets: []. However, unlike vectors, data frames have two dimensions.

So that is why the square brackets in this case take two inputs, separated by a comma:

\[\color{red}{\text{dataset}}[\color{orange}{\text{rows}},\color{lightblue}{\text{columns}}]\]

  • The first value after the opening square bracket refers to \(\color{orange}{\text{which rows}}\) you want to keep.

  • The second value refers to \(\color{lightblue}{\text{which columns}}\) you want to keep.

So if we only want to keep the first row of the first column of our animals_data that is how we would do that:

animals_data[1, 1]
## [1] "greenland_shark"

If we want to keep a certain row but all columns we can do this by leaving the second value within the square brackets empty.

animals_data[1, ]
##           animals animal_lifespans animal_to_human
## 1 greenland_shark              392             3.2

The same works for columns but keep all rows.

This actually returns a vector:

animals_data[, 1]
## [1] "greenland_shark"    "dog"                "galapagos_tortoise"
## [4] "mouse"              "fly"                "lion"              
## [7] "boar"               "alligator"          "human"

Indexing with logical tests

We can also do more complex indexing by keeping only the rows that fulfill a certain condition. Let’s say we only want to keep the rows that contain animals that have longer lifespans than humans.

animals_data$animal_lifespans > human_lifespan
## [1]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
animals_data[animals_data$animal_lifespans > human_lifespan, ]
##              animals animal_lifespans animal_to_human
## 1    greenland_shark              392        3.200000
## 3 galapagos_tortoise              177        1.444898

R Packages

Some amazing R packages: the `easystats` library

Packages are at the heart of R:

  • R packages are basically a collection of functions that you load into your working environment.

  • They contain code that other R users have prepared for the community.

  • It’s good to know your packages, they can really make your life easier.

  • I suggest keeping track of package developments either on Twitter via #rstats

You can install packages in R like this using the install.packages function:

install.packages("janitor")

However, installing is not enough. You also need to load the package via library.

library(janitor)
## Warning: package 'janitor' was built under R version 4.1.3
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Think of install.packages as buying a set of tools (for free!) and library as pulling out the tools each time you want to work with them.

The Tidyverse

What is the tidyverse?

The tidyverse describes itself:

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Core principle: tidy data

  • Every column is a variable.
  • Every row is an observation.
  • Every cell is a single value.

We have already seen tidy data:

Animal Maximum Lifespan Animal/Human Years Ratio
Domestic dog 24.0 5.10
Domestic cat 30.0 4.08
American alligator 77.0 1.59
Golden hamster 3.9 31.41
King penguin 26.0 4.71

Untidy data I

Animal Type Value
Domestic dog lifespan 24.0
Domestic dog ratio 5.10
Domestic cat lifespan 30.0
Domestic cat ratio 4.08
American alligator lifespan 77.0
American alligator ratio 1.59
Golden hamster lifespan 3.9
Golden hamster ratio 31.41
King penguin lifespan 26.0
King penguin ratio 4.71

The data above has multiple rows with the same observation (animal).

= not tidy

Untidy data II

Animal Lifespan/Ratio
Domestic dog 24.0 / 5.10
Domestic cat 30.0 / 4.08
American alligator 77.0 / 1.59
Golden hamster 3.9 / 31.41
King penguin 26.0 / 4.71

The data above has multiple variables per column.

= not tidy

Core principle: tidy data

Artist: Allison Horst

Tidy data has two decisive advantages:

  • Consistently prepared data is easier to read, process, load and save.

  • Many procedures (or the associated functions) in R require this type of data.

Artist: Allison Horst

Installing and loading the tidyverse

First we install the packages of the tidyverse like this. In Google Colab we actually don’t need to install the tidyverse because it comes pre-installed!

install.packages("tidyverse")

Then we load them:

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.7     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'dplyr' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

A new data set appears..

We are going to work with a new data from here on out.

No worries, we will stay within the animal kingdom but we need a data set that is a little more complex than what we have seen already.

Meet the Palmer Station penguins!

Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER.

Artist: Allison Horst

Palmer Penguins

We could install the R package palmerpenguins and then access the data.

However, we are going to use a different method: directly load a .csv file (comma-separated values) into R from the internet.

We can use the readr package which provides many convenient functions to load data into R. Here we need read_csv:

penguins_raw <- read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins_raw.csv")
## Warning in gzfile(file, mode): cannot open compressed file 'C:/Users/favoo/
## AppData/Local/Temp/Rtmpm6bE9l\file7288d5d4f67', probable reason 'No such file or
## directory'
## 
## -- Column specification --------------------------------------------------------
## cols(
##   studyName = col_character(),
##   `Sample Number` = col_double(),
##   Species = col_character(),
##   Region = col_character(),
##   Island = col_character(),
##   Stage = col_character(),
##   `Individual ID` = col_character(),
##   `Clutch Completion` = col_character(),
##   `Date Egg` = col_date(format = ""),
##   `Culmen Length (mm)` = col_double(),
##   `Culmen Depth (mm)` = col_double(),
##   `Flipper Length (mm)` = col_double(),
##   `Body Mass (g)` = col_double(),
##   Sex = col_character(),
##   `Delta 15 N (o/oo)` = col_double(),
##   `Delta 13 C (o/oo)` = col_double(),
##   Comments = col_character()
## )
penguins_raw
## # A tibble: 344 x 17
##    studyName `Sample Number` Species         Region Island Stage `Individual ID`
##    <chr>               <dbl> <chr>           <chr>  <chr>  <chr> <chr>          
##  1 PAL0708                 1 Adelie Penguin~ Anvers Torge~ Adul~ N1A1           
##  2 PAL0708                 2 Adelie Penguin~ Anvers Torge~ Adul~ N1A2           
##  3 PAL0708                 3 Adelie Penguin~ Anvers Torge~ Adul~ N2A1           
##  4 PAL0708                 4 Adelie Penguin~ Anvers Torge~ Adul~ N2A2           
##  5 PAL0708                 5 Adelie Penguin~ Anvers Torge~ Adul~ N3A1           
##  6 PAL0708                 6 Adelie Penguin~ Anvers Torge~ Adul~ N3A2           
##  7 PAL0708                 7 Adelie Penguin~ Anvers Torge~ Adul~ N4A1           
##  8 PAL0708                 8 Adelie Penguin~ Anvers Torge~ Adul~ N4A2           
##  9 PAL0708                 9 Adelie Penguin~ Anvers Torge~ Adul~ N5A1           
## 10 PAL0708                10 Adelie Penguin~ Anvers Torge~ Adul~ N5A2           
## # ... with 334 more rows, and 10 more variables: `Clutch Completion` <chr>,
## #   `Date Egg` <date>, `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
## #   `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>,
## #   `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>

take a glimpse

We can also take a look at data set using the glimpse function from dplyr.

glimpse(penguins_raw)
## Rows: 344
## Columns: 17
## $ studyName             <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL~
## $ `Sample Number`       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1~
## $ Species               <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie P~
## $ Region                <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"~
## $ Island                <chr> "Torgersen", "Torgersen", "Torgersen", "Torgerse~
## $ Stage                 <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adu~
## $ `Individual ID`       <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", ~
## $ `Clutch Completion`   <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", ~
## $ `Date Egg`            <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16,~
## $ `Culmen Length (mm)`  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34~
## $ `Culmen Depth (mm)`   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18~
## $ `Flipper Length (mm)` <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190,~
## $ `Body Mass (g)`       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 34~
## $ Sex                   <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE"~
## $ `Delta 15 N (o/oo)`   <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18~
## $ `Delta 13 C (o/oo)`   <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.298~
## $ Comments              <chr> "Not enough blood for isotopes.", NA, NA, "Adult~

initial data cleaning using janitor

janitor is not offically part of the tidyverse package compilation but in my view it is incredibly important to know.

Provides some convenient functions for basic cleaning of the data.

Just like any tidverse-style package it fullfills the following criteria for its functions:

The data is always the first argument.

This helps us to match by position.

install.packages("janitor")
library(janitor)

clean_names()

One annoyance with the penguins_raw data is that it has spaces in the variable names. Urgh!

R has to put quotes around the variable names that have spaces:

penguins_raw$`Delta 15 N (o/oo)`
##   [1]       NA  8.94956  8.36821       NA  8.76651  8.66496  9.18718  9.46060
##   [9]       NA  9.13362  8.63243       NA       NA       NA  8.55583       NA
##  [17]  9.18528  8.67538  8.47827  9.11616  8.73762  8.66271  9.22286  8.43423
##  [25]  9.63954  9.21292  8.93997  8.08138  8.38404  8.90027  9.69756  9.72764
##  [33]  9.66523  8.79665  9.17847  9.15308  9.18985  8.04787  9.41131       NA
##  [41]  9.68933       NA  9.50772  9.23720  9.36392  9.49106       NA       NA
##  [49]  9.51784  8.87988  8.46616  8.51362  8.19539  8.48095  8.41837  8.35396
##  [57]  8.57199  8.56674  9.07878  9.10800  8.96472  8.74802  8.58063  8.62264
##  [65]  8.62623  8.85562  8.56192  8.71078  8.47781  8.86853  7.88863  9.29808
##  [73]  8.33524  8.18658  8.70642  8.29930  8.47257  8.35540  7.82381  9.05736
##  [81]  7.69778  8.63259  7.88494  8.90002  8.32718  9.14863  8.57087  8.59147
##  [89]  9.07826  8.36936  8.46531  8.77018  8.01485  8.49915  8.90723  8.48204
##  [97]  8.10277  8.39459  9.04218  8.97025  8.84451  9.01079  9.21510  9.51929
## [105]  9.02642  8.85699  8.77322  9.59245  9.79532  9.31735  8.43951  8.65466
## [113]  9.02657  8.80186  8.80967  8.91434  9.18021  9.49645  8.96436  9.32277
## [121]  9.04296  9.11066  9.30722  9.59462  8.81668  9.22537  8.88098  8.52566
## [129]  9.19031  9.10702  8.98460  8.86495  8.98705  8.56708  8.71700  8.94365
## [137]  8.75984  8.95998  8.61651  9.25769  9.28810  9.23408  8.79787  9.05674
## [145]  9.06829  9.22033  9.11006  8.68744  8.94332  8.97533  8.93465  8.89640
## [153]  7.99300  8.14756  8.14705  8.25540  8.23450  7.99530  8.24515  8.22673
## [161]  8.13643  8.16310  8.19579  8.10417  7.77672  7.82080  7.79958  8.07137
## [169]  7.63884  8.27376  7.84057  7.96491  7.89620  7.63220  7.90436  7.90971
## [177]  7.68528  7.83733  7.96621  7.92358  7.68870  8.30515       NA  7.63452
## [185]  7.97408  7.76843  7.89744  8.03659  7.96935  8.13746  8.01979  8.14776
## [193]  8.14567  8.38324  8.37615  8.26548  8.46894  8.27141  8.47829  8.65803
## [201]  8.45167  8.55868  8.38289  8.39867  8.51951  8.50153  8.48789  8.63488
## [209]  8.58319  8.63604  8.48367  8.74647  8.65015  8.60092  8.62870  8.49662
## [217]  8.60447  8.47067  8.24253  8.49854  8.64931  8.63551  8.53018  8.35078
## [225]  8.24651  8.58487  8.47938  8.59640  8.39299  8.40327  8.24694  8.19749
## [233]  8.35802  8.28601  8.19101  8.20042  8.11238  8.27428  8.23468  8.15426
## [241]  8.12691  8.27595  8.29671  8.36701  8.15566  8.83352  8.20106  8.27102
## [249]  8.03624  7.88810  8.16582  8.20660  8.10231  8.31180  8.30817  8.65914
## [257]  8.25818  8.32359  8.12311  8.41017  8.42070  8.45738  8.24691  8.29226
## [265]  8.21634  8.78557  8.30231  8.08354  8.04111  8.33825  7.99184       NA
## [273]  8.41151  8.30166  8.24246  8.36390  9.03935  8.92069  9.29078  8.64701
## [281]  9.00642  8.88942  8.85664  8.63701  8.47173  8.79581  8.95063  8.68747
## [289]  8.72037  9.02330  9.12277  9.80590 10.02019  9.14382  9.32105  9.27158
## [297]  9.35138  9.42666  9.35416  9.28153  9.74144  9.36799  8.93990  9.63074
## [305]  9.37369  9.25177  9.08458  9.49283  9.36668  9.23196  9.75486  9.07825
## [313]  8.83502  9.43146  9.80589 10.02544  9.53262  9.61734 10.02372  9.36493
## [321]  9.43684  9.45827  9.46819  9.34089  9.68950  9.32169  9.46929  9.43782
## [329]  9.41500  9.93727  9.56534  9.77528  9.62357  9.88809  9.74492  9.46985
## [337]       NA  9.65061  9.26715  9.70465  9.37608  9.46180  9.98044  9.39305
penguins_raw$`Flipper Length (mm)`
##   [1] 181 186 195  NA 193 190 181 195 193 190 186 180 182 191 198 185 195 197
##  [19] 184 194 174 180 189 185 180 187 183 187 172 180 178 178 188 184 195 196
##  [37] 190 180 181 184 182 195 186 196 185 190 182 179 190 191 186 188 190 200
##  [55] 187 191 186 193 181 194 185 195 185 192 184 192 195 188 190 198 190 190
##  [73] 196 197 190 195 191 184 187 195 189 196 187 193 191 194 190 189 189 190
##  [91] 202 205 185 186 187 208 190 196 178 192 192 203 183 190 193 184 199 190
## [109] 181 197 198 191 193 197 191 196 188 199 189 189 187 198 176 202 186 199
## [127] 191 195 191 210 190 197 193 199 187 190 191 200 185 193 193 187 188 190
## [145] 192 185 190 184 195 193 187 201 211 230 210 218 215 210 211 219 209 215
## [163] 214 216 214 213 210 217 210 221 209 222 218 215 213 215 215 215 216 215
## [181] 210 220 222 209 207 230 220 220 213 219 208 208 208 225 210 216 222 217
## [199] 210 225 213 215 210 220 210 225 217 220 208 220 208 224 208 221 214 231
## [217] 219 230 214 229 220 223 216 221 221 217 216 230 209 220 215 223 212 221
## [235] 212 224 212 228 218 218 212 230 218 228 212 224 214 226 216 222 203 225
## [253] 219 228 215 228 216 215 210 219 208 209 216 229 213 230 217 230 217 222
## [271] 214  NA 215 222 212 213 192 196 193 188 197 198 178 197 195 198 193 194
## [289] 185 201 190 201 197 181 190 195 181 191 187 193 195 197 200 200 191 205
## [307] 187 201 187 203 195 199 195 210 192 205 210 187 196 196 196 201 190 212
## [325] 187 198 199 201 193 203 187 197 191 203 202 194 206 189 195 207 202 193
## [343] 210 198

janitor can help with that:

using a function called clean_names()

clean_names() just magically turns all our messy column names into readable lower-case snake case:

penguins_clean <- clean_names(penguins_raw)

That is how the variables look like now:

glimpse(penguins_clean)
## Rows: 344
## Columns: 17
## $ study_name        <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708~
## $ sample_number     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1~
## $ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu~
## $ region            <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A~
## $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", ~
## $ stage             <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, ~
## $ individual_id     <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A~
## $ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"~
## $ date_egg          <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 200~
## $ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F~
## $ delta_15_n_o_oo   <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18718,~
## $ delta_13_c_o_oo   <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.29805, ~
## $ comments          <chr> "Not enough blood for isotopes.", NA, NA, "Adult not~

remove_constant()

Now we have another problem. Not all variables in the penguins_clean data set are that useful.

Some of them are the same across all observations. We don’t need those variables, like region.

table(penguins_clean$region)
## 
## Anvers 
##    344

We can use the base R function table to quickly get some tabulations of our variable.

Here to help get rid of these constant columns is the function remove_constant().

penguins_clean <- remove_constant(penguins_clean, quiet = F)
## Removing 2 constant columns of 17 columns total (Removed: region, stage).

When we set quiet = F we even get some info about what exactly was removed. Neat!

Another useful function in janitor is remove_empty() which removes all rows or columns that just consist of missing values (i.e. NA)

Data cleaning using tidyr

Now we are already fairly advanced in our tidying.

But our dataset is still not entirely tidy yet.

Consider the species variable:

table(penguins_clean$species)
## 
##       Adelie Penguin (Pygoscelis adeliae) 
##                                       152 
## Chinstrap penguin (Pygoscelis antarctica) 
##                                        68 
##         Gentoo penguin (Pygoscelis papua) 
##                                       124

This variable violates the tidy rule that each cell should include a single value.

Species hold both the common name and the latin name of the penguin.

separate()

We can use a tidyr function called separate() to turn this into two variables.

Two arguments are important for that:

  • sep: specifies by which character the value should be split
  • into: a vector which specifies the resulting new variable names

In our case we want to split by an empty space and opening bracket \\( and will name our variables species and latin_name:

penguins_clean <- separate(penguins_clean, species, sep = " \\(", into = c("species", "latin_name"))
penguins_clean
## # A tibble: 344 x 16
##    study_name sample_number species        latin_name       island individual_id
##    <chr>              <dbl> <chr>          <chr>            <chr>  <chr>        
##  1 PAL0708                1 Adelie Penguin Pygoscelis adel~ Torge~ N1A1         
##  2 PAL0708                2 Adelie Penguin Pygoscelis adel~ Torge~ N1A2         
##  3 PAL0708                3 Adelie Penguin Pygoscelis adel~ Torge~ N2A1         
##  4 PAL0708                4 Adelie Penguin Pygoscelis adel~ Torge~ N2A2         
##  5 PAL0708                5 Adelie Penguin Pygoscelis adel~ Torge~ N3A1         
##  6 PAL0708                6 Adelie Penguin Pygoscelis adel~ Torge~ N3A2         
##  7 PAL0708                7 Adelie Penguin Pygoscelis adel~ Torge~ N4A1         
##  8 PAL0708                8 Adelie Penguin Pygoscelis adel~ Torge~ N4A2         
##  9 PAL0708                9 Adelie Penguin Pygoscelis adel~ Torge~ N5A1         
## 10 PAL0708               10 Adelie Penguin Pygoscelis adel~ Torge~ N5A2         
## # ... with 334 more rows, and 10 more variables: clutch_completion <chr>,
## #   date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## #   flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## #   delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>

Now there is still a trailing ) at the end of latin_name. We can remove that using the stringr package and more specifically the str_remove() function.

penguins_clean$latin_name <- str_remove(penguins_clean$latin_name, "\\)")
penguins_clean
## # A tibble: 344 x 16
##    study_name sample_number species        latin_name       island individual_id
##    <chr>              <dbl> <chr>          <chr>            <chr>  <chr>        
##  1 PAL0708                1 Adelie Penguin Pygoscelis adel~ Torge~ N1A1         
##  2 PAL0708                2 Adelie Penguin Pygoscelis adel~ Torge~ N1A2         
##  3 PAL0708                3 Adelie Penguin Pygoscelis adel~ Torge~ N2A1         
##  4 PAL0708                4 Adelie Penguin Pygoscelis adel~ Torge~ N2A2         
##  5 PAL0708                5 Adelie Penguin Pygoscelis adel~ Torge~ N3A1         
##  6 PAL0708                6 Adelie Penguin Pygoscelis adel~ Torge~ N3A2         
##  7 PAL0708                7 Adelie Penguin Pygoscelis adel~ Torge~ N4A1         
##  8 PAL0708                8 Adelie Penguin Pygoscelis adel~ Torge~ N4A2         
##  9 PAL0708                9 Adelie Penguin Pygoscelis adel~ Torge~ N5A1         
## 10 PAL0708               10 Adelie Penguin Pygoscelis adel~ Torge~ N5A2         
## # ... with 334 more rows, and 10 more variables: clutch_completion <chr>,
## #   date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## #   flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## #   delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>

There is a also a function called unite() which works in the opposite direction.

Now our data is in tidy format!

We were in luck because the data pretty much already came in a format that was: 1 observation per row.

But what if that is not the case?

pivot_wider() and pivot_longer()

tidyr also comes equipped to deal with data that has more that one observation per row.

The function to use here is called pivot_wider.

Now our penguin_clean data is already tidy.

But we can just read in a dataset that isn’t:

untidy_animals <- read_csv("https://github.com/favstats/ds3_r_intro/blob/main/data/untidy_animals.csv?raw=true")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   Animal = col_character(),
##   Type = col_character(),
##   Value = col_double()
## )
untidy_animals
## # A tibble: 10 x 3
##    Animal             Type     Value
##    <chr>              <chr>    <dbl>
##  1 Domestic dog       lifespan 24   
##  2 Domestic dog       ratio     5.1 
##  3 Domestic cat       lifespan 30   
##  4 Domestic cat       ratio     4.08
##  5 American alligator lifespan 77   
##  6 American alligator ratio     1.59
##  7 Golden hamster     lifespan  3.9 
##  8 Golden hamster     ratio    31.4 
##  9 King penguin       lifespan 26   
## 10 King penguin       ratio     4.71

You may recognize this data from the subsection Untidy data I

Now let’s use pivot_wider to make every row an observation.

We need two main arguments for that:

  1. names_from: tells the function where the new column names come from
  2. values_from: tells the function where the values should come from
tidy_animals <- pivot_wider(untidy_animals,  names_from = Type, values_from = Value)
tidy_animals
## # A tibble: 5 x 3
##   Animal             lifespan ratio
##   <chr>                 <dbl> <dbl>
## 1 Domestic dog           24    5.1 
## 2 Domestic cat           30    4.08
## 3 American alligator     77    1.59
## 4 Golden hamster          3.9 31.4 
## 5 King penguin           26    4.71

pivot_longer can untidy our data again

The argument cols = tells the function which variables to turn into long format:

pivot_longer(tidy_animals,  cols = c(lifespan, ratio))
## # A tibble: 10 x 3
##    Animal             name     value
##    <chr>              <chr>    <dbl>
##  1 Domestic dog       lifespan 24   
##  2 Domestic dog       ratio     5.1 
##  3 Domestic cat       lifespan 30   
##  4 Domestic cat       ratio     4.08
##  5 American alligator lifespan 77   
##  6 American alligator ratio     1.59
##  7 Golden hamster     lifespan  3.9 
##  8 Golden hamster     ratio    31.4 
##  9 King penguin       lifespan 26   
## 10 King penguin       ratio     4.71

Data manipulation using dplyr

Artist: Allison Horst

select()

helps you select variables

select() is part of the dplyr package and helps you select variables

Remember: with tidyverse-style functions, data is always the first argument.

Select variables

Here we only keep individual_id, sex and species.

select(penguins_clean, individual_id, sex, species)
## # A tibble: 344 x 3
##    individual_id sex    species       
##    <chr>         <chr>  <chr>         
##  1 N1A1          MALE   Adelie Penguin
##  2 N1A2          FEMALE Adelie Penguin
##  3 N2A1          FEMALE Adelie Penguin
##  4 N2A2          <NA>   Adelie Penguin
##  5 N3A1          FEMALE Adelie Penguin
##  6 N3A2          MALE   Adelie Penguin
##  7 N4A1          FEMALE Adelie Penguin
##  8 N4A2          MALE   Adelie Penguin
##  9 N5A1          <NA>   Adelie Penguin
## 10 N5A2          <NA>   Adelie Penguin
## # ... with 334 more rows

But select() is more powerful than that.

Remove variables

We can also remove variables with a - (minus).

Here we remove individual_id, sex and species.

select(penguins_clean, -individual_id, -sex, -species)
## # A tibble: 344 x 13
##    study_name sample_number latin_name        island clutch_completi~ date_egg  
##    <chr>              <dbl> <chr>             <chr>  <chr>            <date>    
##  1 PAL0708                1 Pygoscelis adeli~ Torge~ Yes              2007-11-11
##  2 PAL0708                2 Pygoscelis adeli~ Torge~ Yes              2007-11-11
##  3 PAL0708                3 Pygoscelis adeli~ Torge~ Yes              2007-11-16
##  4 PAL0708                4 Pygoscelis adeli~ Torge~ Yes              2007-11-16
##  5 PAL0708                5 Pygoscelis adeli~ Torge~ Yes              2007-11-16
##  6 PAL0708                6 Pygoscelis adeli~ Torge~ Yes              2007-11-16
##  7 PAL0708                7 Pygoscelis adeli~ Torge~ No               2007-11-15
##  8 PAL0708                8 Pygoscelis adeli~ Torge~ No               2007-11-15
##  9 PAL0708                9 Pygoscelis adeli~ Torge~ Yes              2007-11-09
## 10 PAL0708               10 Pygoscelis adeli~ Torge~ Yes              2007-11-09
## # ... with 334 more rows, and 7 more variables: culmen_length_mm <dbl>,
## #   culmen_depth_mm <dbl>, flipper_length_mm <dbl>, body_mass_g <dbl>,
## #   delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>

Selection helpers

These selection helpers match variables according to a given pattern.

starts_with(): Starts with a prefix.

ends_with(): Ends with a suffix.

contains(): Contains a literal string.

matches(): Matches a regular expression.

For example: let’s keep all variables that start with s:

select(penguins_clean, starts_with("s"))
## # A tibble: 344 x 4
##    study_name sample_number species        sex   
##    <chr>              <dbl> <chr>          <chr> 
##  1 PAL0708                1 Adelie Penguin MALE  
##  2 PAL0708                2 Adelie Penguin FEMALE
##  3 PAL0708                3 Adelie Penguin FEMALE
##  4 PAL0708                4 Adelie Penguin <NA>  
##  5 PAL0708                5 Adelie Penguin FEMALE
##  6 PAL0708                6 Adelie Penguin MALE  
##  7 PAL0708                7 Adelie Penguin FEMALE
##  8 PAL0708                8 Adelie Penguin MALE  
##  9 PAL0708                9 Adelie Penguin <NA>  
## 10 PAL0708               10 Adelie Penguin <NA>  
## # ... with 334 more rows

Even more ways to select

Select the first 5 variables:

select(penguins_clean, 1:5)
## # A tibble: 344 x 5
##    study_name sample_number species        latin_name         island   
##    <chr>              <dbl> <chr>          <chr>              <chr>    
##  1 PAL0708                1 Adelie Penguin Pygoscelis adeliae Torgersen
##  2 PAL0708                2 Adelie Penguin Pygoscelis adeliae Torgersen
##  3 PAL0708                3 Adelie Penguin Pygoscelis adeliae Torgersen
##  4 PAL0708                4 Adelie Penguin Pygoscelis adeliae Torgersen
##  5 PAL0708                5 Adelie Penguin Pygoscelis adeliae Torgersen
##  6 PAL0708                6 Adelie Penguin Pygoscelis adeliae Torgersen
##  7 PAL0708                7 Adelie Penguin Pygoscelis adeliae Torgersen
##  8 PAL0708                8 Adelie Penguin Pygoscelis adeliae Torgersen
##  9 PAL0708                9 Adelie Penguin Pygoscelis adeliae Torgersen
## 10 PAL0708               10 Adelie Penguin Pygoscelis adeliae Torgersen
## # ... with 334 more rows

Select everything from individual_id to flipper_length_mm.

select(penguins_clean, individual_id:flipper_length_mm)
## # A tibble: 344 x 6
##    individual_id clutch_completion date_egg   culmen_length_mm culmen_depth_mm
##    <chr>         <chr>             <date>                <dbl>           <dbl>
##  1 N1A1          Yes               2007-11-11             39.1            18.7
##  2 N1A2          Yes               2007-11-11             39.5            17.4
##  3 N2A1          Yes               2007-11-16             40.3            18  
##  4 N2A2          Yes               2007-11-16             NA              NA  
##  5 N3A1          Yes               2007-11-16             36.7            19.3
##  6 N3A2          Yes               2007-11-16             39.3            20.6
##  7 N4A1          No                2007-11-15             38.9            17.8
##  8 N4A2          No                2007-11-15             39.2            19.6
##  9 N5A1          Yes               2007-11-09             34.1            18.1
## 10 N5A2          Yes               2007-11-09             42              20.2
## # ... with 334 more rows, and 1 more variable: flipper_length_mm <dbl>

filter()

helps you filter rows

Here we only keep penguins from the Island Dream.

filter(penguins_clean, island == "Dream")
## # A tibble: 124 x 16
##    study_name sample_number species        latin_name       island individual_id
##    <chr>              <dbl> <chr>          <chr>            <chr>  <chr>        
##  1 PAL0708               31 Adelie Penguin Pygoscelis adel~ Dream  N21A1        
##  2 PAL0708               32 Adelie Penguin Pygoscelis adel~ Dream  N21A2        
##  3 PAL0708               33 Adelie Penguin Pygoscelis adel~ Dream  N22A1        
##  4 PAL0708               34 Adelie Penguin Pygoscelis adel~ Dream  N22A2        
##  5 PAL0708               35 Adelie Penguin Pygoscelis adel~ Dream  N23A1        
##  6 PAL0708               36 Adelie Penguin Pygoscelis adel~ Dream  N23A2        
##  7 PAL0708               37 Adelie Penguin Pygoscelis adel~ Dream  N24A1        
##  8 PAL0708               38 Adelie Penguin Pygoscelis adel~ Dream  N24A2        
##  9 PAL0708               39 Adelie Penguin Pygoscelis adel~ Dream  N25A1        
## 10 PAL0708               40 Adelie Penguin Pygoscelis adel~ Dream  N25A2        
## # ... with 114 more rows, and 10 more variables: clutch_completion <chr>,
## #   date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## #   flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## #   delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>

%in%

Here the %in% operator can come in handy again if we want to filter more than one island:

islands_to_keep <- c("Dream", "Biscoe")

filter(penguins_clean, island %in% islands_to_keep)
## # A tibble: 292 x 16
##    study_name sample_number species        latin_name       island individual_id
##    <chr>              <dbl> <chr>          <chr>            <chr>  <chr>        
##  1 PAL0708               21 Adelie Penguin Pygoscelis adel~ Biscoe N11A1        
##  2 PAL0708               22 Adelie Penguin Pygoscelis adel~ Biscoe N11A2        
##  3 PAL0708               23 Adelie Penguin Pygoscelis adel~ Biscoe N12A1        
##  4 PAL0708               24 Adelie Penguin Pygoscelis adel~ Biscoe N12A2        
##  5 PAL0708               25 Adelie Penguin Pygoscelis adel~ Biscoe N13A1        
##  6 PAL0708               26 Adelie Penguin Pygoscelis adel~ Biscoe N13A2        
##  7 PAL0708               27 Adelie Penguin Pygoscelis adel~ Biscoe N17A1        
##  8 PAL0708               28 Adelie Penguin Pygoscelis adel~ Biscoe N17A2        
##  9 PAL0708               29 Adelie Penguin Pygoscelis adel~ Biscoe N18A1        
## 10 PAL0708               30 Adelie Penguin Pygoscelis adel~ Biscoe N18A2        
## # ... with 282 more rows, and 10 more variables: clutch_completion <chr>,
## #   date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## #   flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## #   delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>

mutate()

helps you create variables

mutate will take a statement like this:

variable_name = some_calculation

and attach variable_name at the end of the dataset.

Let’s say we want to calculate penguin bodymass in kg rather than gram.

We take the variable body_mass_g and divided by 1000.

pg_new <- mutate(penguins_clean, bodymass_kg = body_mass_g/1000)

We temporarily assign the dataset to pg_new just to check whether it worked correctly:

select(pg_new, bodymass_kg, body_mass_g)
## # A tibble: 344 x 2
##    bodymass_kg body_mass_g
##          <dbl>       <dbl>
##  1        3.75        3750
##  2        3.8         3800
##  3        3.25        3250
##  4       NA             NA
##  5        3.45        3450
##  6        3.65        3650
##  7        3.62        3625
##  8        4.68        4675
##  9        3.48        3475
## 10        4.25        4250
## # ... with 334 more rows

Recoding with ifelse

ifelse() is a very useful function that allows to easily recode variables based on logical tests.

It’s basic functionality looks like this:

\[\color{red}{\text{ifelse}}(\color{orange}{\text{logical test}},\color{blue}{\text{what should happen if TRUE}}, \color{green}{\text{what should happen if FALSE}})\]

Here is a very basic example:

ifelse(1 == 1, "Pick me if test is TRUE", "Pick me if test is FALSE")
## [1] "Pick me if test is TRUE"
ifelse(1 != 1, "Pick me if test is TRUE", "Pick me if test is FALSE")
## [1] "Pick me if test is FALSE"

Let’s use ifelse in combination with mutate.

Let’s create the variable sex_short which has a shorter label for sex:

pg_new <- mutate(penguins_clean, sex_short = ifelse(sex == "MALE", "m", "f"))

We temporarily assign the dataset to pg_new just to check whether it worked correctly:

select(pg_new, sex, sex_short)
## # A tibble: 344 x 2
##    sex    sex_short
##    <chr>  <chr>    
##  1 MALE   m        
##  2 FEMALE f        
##  3 FEMALE f        
##  4 <NA>   <NA>     
##  5 FEMALE f        
##  6 MALE   m        
##  7 FEMALE f        
##  8 MALE   m        
##  9 <NA>   <NA>     
## 10 <NA>   <NA>     
## # ... with 334 more rows

Recoding with case_when

case_when (from the dplyr package) is like ifelse but allows for much more complex combinations.

The basic setup for a case_when call looks like this:

case_when(

\(\color{orange}{\text{logical test}}\) ~ \(\color{blue}{\text{what should happen if TRUE}}\),

\(\color{orange}{\text{logical test}}\) ~ \(\color{blue}{\text{what should happen if TRUE}}\),

\(\color{orange}{\text{logical test}}\) ~ \(\color{blue}{\text{what should happen if TRUE}}\),

\(TRUE\) ~ \(\color{green}{\text{what should happen with everything else}}\),

)

The following code recodes a numeric vector (1 through 50) into three categorical ones:

x <- c(1:50)

x
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
case_when(
  x %in% 1:10 ~ "1 through 10",
  x %in% 11:30 ~ "11 through 30",
  TRUE ~ "above 30"
)
##  [1] "1 through 10"  "1 through 10"  "1 through 10"  "1 through 10" 
##  [5] "1 through 10"  "1 through 10"  "1 through 10"  "1 through 10" 
##  [9] "1 through 10"  "1 through 10"  "11 through 30" "11 through 30"
## [13] "11 through 30" "11 through 30" "11 through 30" "11 through 30"
## [17] "11 through 30" "11 through 30" "11 through 30" "11 through 30"
## [21] "11 through 30" "11 through 30" "11 through 30" "11 through 30"
## [25] "11 through 30" "11 through 30" "11 through 30" "11 through 30"
## [29] "11 through 30" "11 through 30" "above 30"      "above 30"     
## [33] "above 30"      "above 30"      "above 30"      "above 30"     
## [37] "above 30"      "above 30"      "above 30"      "above 30"     
## [41] "above 30"      "above 30"      "above 30"      "above 30"     
## [45] "above 30"      "above 30"      "above 30"      "above 30"     
## [49] "above 30"      "above 30"

Let’s use case_when in combination with mutate.

Creating the variable short_island which has a shorter label for island:

test <- mutate(penguins_clean, 
        island_short = case_when(
          island == "Torgersen" ~ "T",
          island == "Biscoe" ~ "B",
          island == "Dream" ~ "D"
        ))
select(test, island, island_short)
## # A tibble: 344 x 2
##    island    island_short
##    <chr>     <chr>       
##  1 Torgersen T           
##  2 Torgersen T           
##  3 Torgersen T           
##  4 Torgersen T           
##  5 Torgersen T           
##  6 Torgersen T           
##  7 Torgersen T           
##  8 Torgersen T           
##  9 Torgersen T           
## 10 Torgersen T           
## # ... with 334 more rows

With case_when you can also mix different variables making this a very powerful tool!

rename()

Just changes the variable name but leaves all else intact:

rename(penguins_clean, sample = sample_number)
## # A tibble: 344 x 16
##    study_name sample species    latin_name island individual_id clutch_completi~
##    <chr>       <dbl> <chr>      <chr>      <chr>  <chr>         <chr>           
##  1 PAL0708         1 Adelie Pe~ Pygosceli~ Torge~ N1A1          Yes             
##  2 PAL0708         2 Adelie Pe~ Pygosceli~ Torge~ N1A2          Yes             
##  3 PAL0708         3 Adelie Pe~ Pygosceli~ Torge~ N2A1          Yes             
##  4 PAL0708         4 Adelie Pe~ Pygosceli~ Torge~ N2A2          Yes             
##  5 PAL0708         5 Adelie Pe~ Pygosceli~ Torge~ N3A1          Yes             
##  6 PAL0708         6 Adelie Pe~ Pygosceli~ Torge~ N3A2          Yes             
##  7 PAL0708         7 Adelie Pe~ Pygosceli~ Torge~ N4A1          No              
##  8 PAL0708         8 Adelie Pe~ Pygosceli~ Torge~ N4A2          No              
##  9 PAL0708         9 Adelie Pe~ Pygosceli~ Torge~ N5A1          Yes             
## 10 PAL0708        10 Adelie Pe~ Pygosceli~ Torge~ N5A2          Yes             
## # ... with 334 more rows, and 9 more variables: date_egg <date>,
## #   culmen_length_mm <dbl>, culmen_depth_mm <dbl>, flipper_length_mm <dbl>,
## #   body_mass_g <dbl>, sex <chr>, delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>,
## #   comments <chr>

arrange()

You can order your data to show the highest or lowest value first.

Let’s order by flipper_length_mm.

Lowest first:

arrange(penguins_clean, flipper_length_mm)
## # A tibble: 344 x 16
##    study_name sample_number species           latin_name    island individual_id
##    <chr>              <dbl> <chr>             <chr>         <chr>  <chr>        
##  1 PAL0708               29 Adelie Penguin    Pygoscelis a~ Biscoe N18A1        
##  2 PAL0708               21 Adelie Penguin    Pygoscelis a~ Biscoe N11A1        
##  3 PAL0910              123 Adelie Penguin    Pygoscelis a~ Torge~ N67A1        
##  4 PAL0708               31 Adelie Penguin    Pygoscelis a~ Dream  N21A1        
##  5 PAL0708               32 Adelie Penguin    Pygoscelis a~ Dream  N21A2        
##  6 PAL0809               99 Adelie Penguin    Pygoscelis a~ Dream  N50A1        
##  7 PAL0708                7 Chinstrap penguin Pygoscelis a~ Dream  N66A1        
##  8 PAL0708               48 Adelie Penguin    Pygoscelis a~ Dream  N29A2        
##  9 PAL0708               12 Adelie Penguin    Pygoscelis a~ Torge~ N6A2         
## 10 PAL0708               22 Adelie Penguin    Pygoscelis a~ Biscoe N11A2        
## # ... with 334 more rows, and 10 more variables: clutch_completion <chr>,
## #   date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## #   flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## #   delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>

Highest first using desc() (for descendant):

arrange(penguins_clean, desc(flipper_length_mm))
## # A tibble: 344 x 16
##    study_name sample_number species        latin_name       island individual_id
##    <chr>              <dbl> <chr>          <chr>            <chr>  <chr>        
##  1 PAL0809               64 Gentoo penguin Pygoscelis papua Biscoe N19A2        
##  2 PAL0708                2 Gentoo penguin Pygoscelis papua Biscoe N31A2        
##  3 PAL0708               34 Gentoo penguin Pygoscelis papua Biscoe N56A2        
##  4 PAL0809               66 Gentoo penguin Pygoscelis papua Biscoe N20A2        
##  5 PAL0809               76 Gentoo penguin Pygoscelis papua Biscoe N56A2        
##  6 PAL0910               90 Gentoo penguin Pygoscelis papua Biscoe N14A2        
##  7 PAL0910              114 Gentoo penguin Pygoscelis papua Biscoe N34A2        
##  8 PAL0910              116 Gentoo penguin Pygoscelis papua Biscoe N35A2        
##  9 PAL0809               68 Gentoo penguin Pygoscelis papua Biscoe N51A2        
## 10 PAL0910              112 Gentoo penguin Pygoscelis papua Biscoe N32A2        
## # ... with 334 more rows, and 10 more variables: clutch_completion <chr>,
## #   date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## #   flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## #   delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>

group_by() and summarize()

When you want to aggregate your data (by groups)

Sometimes we want to calculate group statistics.

In other languages this is often a pain.

With dplyr this is fairly easy and readable.

Let’s calculate the average culmen_length_mm for each sex.

First we group penguins_clean by sex.

grouped_by_sex <- group_by(penguins_clean, sex)

summarize works in a similar way to mutate:

variable_name = some_calculation

summarise(grouped_by_sex, avg_culmen_length = mean(culmen_length_mm, na.rm = T))
## # A tibble: 3 x 2
##   sex    avg_culmen_length
##   <chr>              <dbl>
## 1 FEMALE              42.1
## 2 MALE                45.9
## 3 <NA>                41.3

We could also keep the data structure by using mutate on a grouped dataset:

mutate(grouped_by_sex, avg_culmen_length = mean(culmen_length_mm, na.rm = T))
## # A tibble: 344 x 17
## # Groups:   sex [3]
##    study_name sample_number species        latin_name       island individual_id
##    <chr>              <dbl> <chr>          <chr>            <chr>  <chr>        
##  1 PAL0708                1 Adelie Penguin Pygoscelis adel~ Torge~ N1A1         
##  2 PAL0708                2 Adelie Penguin Pygoscelis adel~ Torge~ N1A2         
##  3 PAL0708                3 Adelie Penguin Pygoscelis adel~ Torge~ N2A1         
##  4 PAL0708                4 Adelie Penguin Pygoscelis adel~ Torge~ N2A2         
##  5 PAL0708                5 Adelie Penguin Pygoscelis adel~ Torge~ N3A1         
##  6 PAL0708                6 Adelie Penguin Pygoscelis adel~ Torge~ N3A2         
##  7 PAL0708                7 Adelie Penguin Pygoscelis adel~ Torge~ N4A1         
##  8 PAL0708                8 Adelie Penguin Pygoscelis adel~ Torge~ N4A2         
##  9 PAL0708                9 Adelie Penguin Pygoscelis adel~ Torge~ N5A1         
## 10 PAL0708               10 Adelie Penguin Pygoscelis adel~ Torge~ N5A2         
## # ... with 334 more rows, and 11 more variables: clutch_completion <chr>,
## #   date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## #   flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## #   delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>,
## #   avg_culmen_length <dbl>

Once we are done with group_by we should ungroup our data gain.

next_data <- ungroup(grouped_by_sex)

count()

Now this is a function that I use all the time.

This function helps you count how often a certain value occur(s) within variables(s).

Simply specify which variable you want to count.

Let’s count how often the species occur.

count(penguins_clean, species, sort = T) 
## # A tibble: 3 x 2
##   species               n
##   <chr>             <int>
## 1 Adelie Penguin      152
## 2 Gentoo penguin      124
## 3 Chinstrap penguin    68

The sort = T tells the function to sort by the highest occuring frequency.

The %>% operator

The point of the pipe is to help you write code in a way that is easier to read and understand.

Let’s consider an example with some data manipulation we have done so far:

## first I select variables
pg <- select(penguins_clean, individual_id, island, body_mass_g)

## then I filter to only Dream island
pg <- filter(pg, island == "Dream")

## then I convert body_mass_g to kg
pg <- mutate(pg, bodymass_kg = body_mass_g/1000)

## rename individual id to simply id
pg <- rename(pg, id = individual_id)

Now this works but the problem is: we have to write a lot of code that repeats itself!

pg
## # A tibble: 124 x 4
##    id    island body_mass_g bodymass_kg
##    <chr> <chr>        <dbl>       <dbl>
##  1 N21A1 Dream         3250        3.25
##  2 N21A2 Dream         3900        3.9 
##  3 N22A1 Dream         3300        3.3 
##  4 N22A2 Dream         3900        3.9 
##  5 N23A1 Dream         3325        3.32
##  6 N23A2 Dream         4150        4.15
##  7 N24A1 Dream         3950        3.95
##  8 N24A2 Dream         3550        3.55
##  9 N25A1 Dream         3300        3.3 
## 10 N25A2 Dream         4650        4.65
## # ... with 114 more rows

Another alternative is to nest all the functions:

rename(mutate(filter(select(penguins_clean, individual_id, island, body_mass_g), island == "Dream"), bodymass_kg = body_mass_g/1000), id = individual_id)
## # A tibble: 124 x 4
##    id    island body_mass_g bodymass_kg
##    <chr> <chr>        <dbl>       <dbl>
##  1 N21A1 Dream         3250        3.25
##  2 N21A2 Dream         3900        3.9 
##  3 N22A1 Dream         3300        3.3 
##  4 N22A2 Dream         3900        3.9 
##  5 N23A1 Dream         3325        3.32
##  6 N23A2 Dream         4150        4.15
##  7 N24A1 Dream         3950        3.95
##  8 N24A2 Dream         3550        3.55
##  9 N25A1 Dream         3300        3.3 
## 10 N25A2 Dream         4650        4.65
## # ... with 114 more rows

But that’s extremely tough to read and understand!

The piping style:

Read from top to bottom and from left to right and the %>% as “and then”.

Data first, data once

penguins_clean %>% 
  select(individual_id, island, body_mass_g) %>% 
  filter(island == "Dream") %>% 
  mutate(bodymass_kg = body_mass_g/1000) %>% 
  rename(id = individual_id)
## # A tibble: 124 x 4
##    id    island body_mass_g bodymass_kg
##    <chr> <chr>        <dbl>       <dbl>
##  1 N21A1 Dream         3250        3.25
##  2 N21A2 Dream         3900        3.9 
##  3 N22A1 Dream         3300        3.3 
##  4 N22A2 Dream         3900        3.9 
##  5 N23A1 Dream         3325        3.32
##  6 N23A2 Dream         4150        4.15
##  7 N24A1 Dream         3950        3.95
##  8 N24A2 Dream         3550        3.55
##  9 N25A1 Dream         3300        3.3 
## 10 N25A2 Dream         4650        4.65
## # ... with 114 more rows

group_by() again

Grouping also become easier using pipes.

Let’s try again to calculate the average culmen_length_mm for each sex but this time with pipes.

penguins_clean %>% 
  group_by(sex) %>% 
  summarise(avg_culmen_length = mean(culmen_length_mm , na.rm = T)) %>%
  ungroup()
## # A tibble: 3 x 2
##   sex    avg_culmen_length
##   <chr>              <dbl>
## 1 FEMALE              42.1
## 2 MALE                45.9
## 3 <NA>                41.3

tidyverse style syntax meme

Small Note on the Pipe

Since R Version 4.1.0 Base R also provides a pipe.

It looks like this: \(|>\)

While it shares many similarities with the %>% there are also some differences.

It’s beyond the scope of this workshop to go over it here so for the sake of simplicity we will stick with the magrittr pipe.

It’s time to type some R code

Open 04_exercises_II.Rmd